A Compiler-driven Supercomputer
نویسندگان
چکیده
The overall prrformance of supercomputers is slow compared to the speed of their underlying logic technology. This discrepancy is due to several bottlenecks: memories are slower than the CPU, conditional jumps limit the usefulness of pipelining and pre-fetching mechanisms, and functional-unit parallelism is limited by the speed of hardware scheduling. This paper describes a supercomputer architecture called Ring of Pre-fetch Elements (ROPE) that attempts to solve the problems of memory latency and conditional jumps, without hardware scheduling. ROPE consists of a very pipelined CPU data path with a new instruction pre-fetching mechanism that supports general multi-way conditional jumps. An optimizing compiler based on a global code transformation technique (Percolation Scheduling or PS) gives high performance without scheduling hardware. INTRODUCTION Traditional computer architectures use resources inefficiently, resulting in machines whose performance is disappointing when compared to the raw speed of their components. The main technique for running processors near the limits of a technology is pipelining. An operation in a pipelined machine may take several cycles to complete, but a new operation can be started on each cycle, so the throughput remains high. The benefits of pipelining, however, have been limited by the difficulty of keeping the pipeline full. The difficulty can be traced to two sources: data dependencies and the slowness of memory. A data dependency is a relationship between two instructions that use a ccnnmcm register or memory location. The second operation cannot begin until the first operation is finished using the register or memory. In many supercomputer architectures, complex scheduling hardware is used to keep the almost independent processing units from violating the data dependencies implicit in the code. Although scheduling hardware allows some overlapping of normally sequential operations, the final machine is only about twice as fast as a strictly sequential machine. Even with far more sophisticated scheduling hardware than that in current machines, only another factor of one and a half is obtained [18] The scheduling mechanism is not only expensive to build, but it also slows down the basic cycle time, since it must operate faster than the processing units. Large memories are slow compared with modern processing elements, limiting the performance of a machine in two ways. First, instructions to be executed must be fetched from memory. Second, data operations that need to read from or write to memory take a long time to complete, delaying other instructions. For straight-line code, conventional pre-fetch units and instruction caches remove most of the instruction-fetch delay, but substantial penalties are incurred for conditional jumps and cache misses. The smallness of basic blocks[l7], (121 and corresponding frequency of jumps has usually limited the size of pipelines to two or three stages [16]. RISC (Reduced Instruction Set Computer) designs try to gain performance without scheduling hardware by making all instructions take the same time. The more complex operations, such as floating-point arithmetic, have been broken into smaller operations that can be executed quickly. The approach works well for small machines [8] [15], but is unsuitable for high-performance _ ‘This work is supported in part by NSF grant DCR-8502884 and the Cornell NSF Supercomputing seater. Cop?iriphr 1986 by Elsevier science Publishing co., Inc. *ppl*carto”s 0f SupercwpuLers
منابع مشابه
Scientific Flow Field Simulation of Cruciform Missiles Through the Thin Layer Navier Stokes Equations
The thin-layer Navier-Stokes equations are solved for two complete missile configurations on an IBM 3090-200 vectro-facility supercomputer. The conservation form of the three-dimensional equations, written in generalized coordinates, are finite differenced and solved on a body-fitted curvilinear grid system developed in conjunction with the flowfield solver. The numerical procedure is based on ...
متن کاملPartial Evaluation for Scienti c Computing : The Supercomputer Toolkit Experience
We describe the key role played by partial evaluation in the Supercomputer Toolkit, a parallel computing system for scienti c applications that e ectively exploits the vast amount of parallelism exposed by partial evaluation. The Supercomputer Toolkit parallel processor and its associated partial evaluation-based compiler have been used extensively by scientists at M.I.T., and have made possibl...
متن کاملPartial Evaluation for Scienti
We describe the key role played by partial evaluation in the Supercomputer Toolkit a parallel comput ing system for scienti c applications that e ectively exploits the vast amount of parallelism exposed by partial evaluation The Supercomputer Toolkit parallel processor and its associ ated partial evaluation based compiler have been used exten sively by scientists at M I T and have made possible...
متن کاملNessie: A NESL to CUDA Compiler
Modern GPUs provide supercomputer-level performance at commodity prices, but they are notoriously hard to program. To address this problem, we have been exploring the use of Nested Data Parallelism (NDP), and specifically the first-order functional language NESL, as a way to raise the level of abstraction for programming GPUs. This paper describes a new compiler for NESL language that generated...
متن کاملMDE4HPC: An Approach for Using Model-Driven Engineering in High-Performance Computing
With the increasing number of programming paradigms and hardware architectures, high performance computing is becoming more and more complex in exploiting efficiently and sustainably supercomputers resources. Our thesis is that Model Driven Engineering (MDE) can help us in dealing with this complexity, by abstracting some platform dependent details. In this paper we present our approach (MDE4HP...
متن کاملPartial Evaluation for Scientific Computing: The Supercomputer Toolkit Experience
We describe the key role played by partial evaluation in the Supercomputer Toolkit, a parallel computing system for scientiic applications that eeectively exploits the vast amount of parallelism exposed by partial evaluation. The Supercomputer Toolkit parallel processor and its associated partial evaluation-based compiler have been used extensively by scientists at M.I.T., and have made possibl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001